{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h1><center>Introduction</center></h1>\n",
    "# 1.Types of machine learning\n",
    "* Supervised learning\n",
    "* Unsupervised learning\n",
    "* Reinforcement learning\n",
    "\n",
    "**Supervised learning** The goal is to learn a mapping from inputs $\\vec{x}$ to ouputs $y$, given a labeled set of input-output pairs $D=\\lbrace\\left(\\vec{x}_i,y_i\\right)\\rbrace_{i=1}^N$. Here $D$ is called the **training set**.\n",
    "Each training input $\\vec{x}_i$ is a D-dimensional vector of numbers,which are called features. But in general , the input $\\vec{x}_i$ could be any complex structured object, such as an image, a sentence, an time series, etc.\n",
    "Each training output $y_i$ could also be any object, but most algorithm assuming that $y_i$ is a categorical label or real-valued scalar, which corresponding to classification problem and regression problem respectively.\n",
    "\n",
    "**Unsupervised learning** Unlike supervised learning ,in unsupervised learning, we only have inputs $D=\\lbrace \\vec{x}_i \\rbrace _{i=1}^N$. The goal of unsupervised learning was to discover interesting pattern in the inputs. So as to find the relationship between the different features of the inputs. Mathmetically speaking, we want to learn the joint distribution of different features of the inputs $p(\\vec{x}|D)$ from the training set.\n",
    "\n",
    "**Reinforcement learning** The goal of reinforcement learning was to learn to behave when given occasional reward.It's mostly used in game theory and robots design. We won't discuss it here.\n",
    "# 2.Supervised learning\n",
    "The supervised learning was the form of machine learning most widely used in practice. The supervised learning can be viewed as **function approximately**, we assuming the input $\\vec{x}$ and output $y$ satisfied some function $y=f\\left(\\vec{x}\\right)$. We use the training set to estimate the unknown parameters in the $f$, then we have our estimated model $\\hat{f}$. We can use the estimated model $\\hat{f}$ to predict the output of new input $\\hat{y}=\\hat{f}\\left(\\vec{x}\\right)$. This is called **generalize**.\n",
    "## Classification\n",
    "The goal of classfication was to learn a mapping from input $\\vec{x}$ to output y, where $y\\in \\lbrace 1,\\ldots C \\rbrace$, where $C$ is the number of classes. If C=2, we call it  binary classification. Otherwise, we call it multiclass classification.\n",
    "## Regression\n",
    "Regression is different from classification for that the response variable is continuous.\n",
    "\n",
    "# 3.Unsupervised learning\n",
    "Unsupervised learning was different from supervised learning for that it only has input data. The goal is to discover \"interesting pattern\" in the data. The task of unsupervised learning is one of density estimation. We want to build a model to model $p\\left(\\vec{x}_i | \\theta\\right)$. Different method of unsupervised learning is different at how to parameterize $p\\left(\\vec{x}_i | \\theta\\right)$.\n",
    "## Discovering clusters\n",
    "In discovering clusters, we assume that there is a latent category variable $z_i \\in \\lbrace 1,\\ldots, C \\rbrace$ with each sample in $D=\\lbrace (\\vec{x}_i) \\rbrace_{i=1}^N$. This is different from the supervised classification problems, because we can't \"see\" the latent category variable. So the actual joint distribution of $\\vec{x}$ is a marginal distribution of $p(\\vec{x},z|\\theta)$ as following $$p(\\vec{x}|\\theta)=\\sum_{c=1}^C p(\\vec{x}|\\theta,z_i=c)p(z_i=c)$$ Then we can estimate which cluster each point belongs to. We can calculate $z_i$ using Bayes law as $$ p\\left(z_i=c | \\vec{x}_i, \\theta \\right)=\\frac {p\\left(\\vec{x}_i | z_i=c, \\theta \\right) p\\left(z_i=c | \\theta\\right)}{p\\left(\\vec{x}_i | \\theta\\right)}$$\n",
    "## Discovering latent factors\n",
    "When dealing with high dimensional data, it's often useful to reduce the dimensionality by projecting the data to a lower dimensional subspace which captures the \"essence\" of the data, which is called **dimensionality reduction**. The motivation behind this technique is that although the data may appear high dimensional, there may only be a small number of degrees of variability, corresponding to **latent factors**. We assuming that the latent factors was $\\vec{z}$, then the actual joint distribution of $\\vec{x}$ is a marginal distribution of $p(\\vec{x},\\vec{z})$ as following.\n",
    "\\begin{equation*}\n",
    "\tp\\left(\\vec{x}|\\theta\\right)=\\int\\,p\\left(\\vec{x}|\\vec{z}\\right)p\\left(\\vec{z}\\right)d\\vec{z}\n",
    "\\end{equation*}\n",
    "This equation was similar to the discovering cluster, except that the latent factors is multi-dimention and continuous.\n",
    "\n",
    "# 4. Some basic concepts in machine learning\n",
    "## Parametrix vs non-parametric models\n",
    "**Parametrix** We use a special form of distribution to parameterize the model. Therefore, the number of parameters is not grew with the amount of training data, most of the practical models are parametrix. But we need make strong assumption on the form of the distribution. This is the most difficult part of Parametrix models. \n",
    "\n",
    "**Non-parametric** We make weak assumption on the distribution of data and we just **remember the data**. Inferring bases on the memory.The number of parameters is grow with the amount of training data. The non-parametric model is more flexible than parametric models. But it's not practical for large dataset.$K$-nearest neighbors(**KNN**) is one of the most simple non-parametric model.\n",
    "\n",
    "**Curse of dimensionality** With high dimensioanl data, the non-parametric models work not so well. The reason is that as the dimension of the data grows, the volume of the space increases so fast that the available data become sparse. In most non-parametric models, we need to search the \"neighbors\" which becomes hard for sparse data. This phenomena is called **curse of dimensionality**.\n",
    "\n",
    "## Overfitting\n",
    "With any training dataset $D=\\lbrace\\left(\\vec{x}_i,y_i\\right)\\rbrace_{i=1}^N$, we can always fit a model which performs very well on $D$ because we can use a highly flexible models. But fitting highly flexible models might lead to **overfitting** which means that the model performs very good on training dataset but performs bad on \"unseen\" data. The reason is that there are always noise in the dataset. Using flexible models might try to model these noises, which is unlikely suitable.\n",
    "\n",
    "## Model selection\n",
    "When facing with a practical machine learning problem, we can choose a lot of models of different complexity, range from linear regression to deep neural network, we should choose the model whose complexity is most matched with the complexity of the problem. The rule is to minimize the **generalization error** which is the accuracy calcualted on test data. The model which minimizes the **generalization error** is the one most match the problem. The **cross validation(CV)** is the most widely used method for model selection."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
